The dataset we used for the project is the records of 311 Service Requests, a government hotline from the NYC open data website which reflects the daily life problems of many residents. Link
The dataset includes government hotlines' records from 2010 to the present, about a decade, covering all aspects of residents' daily life.
New yorkers can complain by visiting NYC's online customer service, text messages, phone calls, skype, etc.NYC 311 dataset covers all aspects of citizen's life in New York, which can be roughly divided into the following categories: Benefit& Support, Business& Consumers, Courts& Law, Culture& Recreation, Education, Employment, Environment, Garbage& Recycling, Government& Elections, Health, Housing& Buildings, Noise, Pets,Pests& Wildlife, Public safety, Records, Sidewalks,Streets& highways, Taxes, Transportation.
NYC311's mission is to provide the public with fast, convenient city government services and information, while providing the best customer service. It also helps organizations improve the services they offer, allowing them to focus on their core tasks and manage their workloads effectively. Meanwhile, NYC 311 also provides insights into improving city government through accurate and consistent measurement and analysis of service delivery.
Moreover,NYC311 is available 24 hours a day, 7 days a week, 365 days a year.
Not only does NYC311 offer an online translation service in more than 50 languages, but users can call the 311 hotline in more than 175 languages if their language is not included.In addition, people who are deaf, hard of hearing or have language impairment can also complaint with special help such as video relay service (VRS).
We believe there is a lot of information to explore in such a large and data-rich dataset.
The material we used for this explainer notebook can be found here. You can also find all source code and dataset in the folder. For limited memory space on the laptop, the dataset has been transformed and split into several files.
However, it was impossible for us to conduct a comprehensive analysis of this incredible huge dataset, so after preliminary statistics, we chose the category with the most cumulative complaints over the past decade: Noise.
First of all, when it comes to environmental pollution, people may first think of air, soil, water and other aspects, but noise pollution, as an invisible and intangible existence, has the same impact on us that cannot be ignored. As a serious "urban disease", noise pollution has increasingly become the focus of modern urban life. New York, as a prosperous international city, also has such problems.
Moreover, We want to study the noise complaints in New York and analyze them from both spatial perspective and temporal perspective. We hope to learn something about the urban conditions, economic development, residents' living conditions and traffic conditions, etc, in the five boroughs of New York through the noise complaints. Moreover, we wonder whether noise complaints can be used to describe the overall development and brief condition of New York City over a 10-year period.
To begin with, we want to share interesting insights to the readers from noise analysis. The seemingly boring government complaints hotline actually contains many interesting insights, which not only reflect the people's life in New York, but also provide some directions and suggestions for the government to improve the city service.
Also, via the analysis of noise complaints in NYC, we hope users could understand the characters, living habits, preferences and cultural backgrounds of the residents in the five different boroughs of New York.
Further more, we hope that readers can freely access the information they find useful through interactive map and interactive bar by reading the New York stories presented by us, which can not only increase readers' understanding but also make reading more participatory and interesting.
import pandas as pd
import numpy as np
df_origin=pd.read_csv('311-2019-all.csv')
df=pd.read_csv('311-All-Concise-with-IncidentZip.csv')
The dataset has 22.8M rows and 41 columns with size of 12GB. The dataset is shown as follows.
df_origin.head(10)
The attributes are shown as follows:
df_origin.columns
We made a bar chart to show the 15 most frequently complaint type in New York during 2010~2020 to get some inspiration.
import matplotlib.pyplot as plt
complaint_count=df['Complaint Type'].value_counts()
complaint_count.iloc[0:20]
title='The 15 most frequent complaint type in New York during 2010~2020'
to_display=complaint_count[0:15]
f,p=plt.subplots(figsize=(10,8))
p.bar(to_display.index,to_display.values)
p.tick_params(axis='x',labelrotation=90)
p.tick_params(labelsize=10)
p.set_title(title,fontsize=12)
p.set_xlabel('Complaint Type',fontsize=10)
p.set_ylabel('Number of cases',fontsize=10)
plt.show()
From the figure, we found noise is the most reported complain type, which inspired us to discover more about it. For temporal and spatial analysis of Noise, we think only 9 attributes are relevant and retained.
df.columns
These attributes are used for the different purpuses.
The attributes we used directly for the report are:
Firstly, We adopt Created Data as the time when the incident happened. It has to be transformed to pandas datetime objets so that we can extract the information, e.g. year or month.
suitform='%m/%d/%Y %H:%M:%S %p'
df['TransCDatetime']=pd.to_datetime(df['Created Date'],format=suitform)
df['month']=[i.month+(i.year-2010)*12 for i in df['TransCDatetime']]
time_nan=df['TransCDatetime'].isna()
time_nan.sum()
print('The percentage of nan value of for created time is {:10.2f}%'.format(time_nan.sum()/df.shape[0]*100))
We successffully transformed the format of datatime, which indicates all the elements are valid and also no NaN value is detected in the attribute.
For noise analysis, we focused on the 5 most frequently reported types. All the 5 noise types are in the 50 top reported of all types.
complaint_count=df['Complaint Type'].value_counts()
TOP_COMPLAINTS=50
cared=complaint_count.iloc[0:TOP_COMPLAINTS].index
Noise_type=[]
for i in cared:
if 'oise' in i:
Noise_type.append(i)
Noise_type
In each main type, we also have subtypes which are shown below.
Noise_summary=dict()
for i in Noise_type:
temp=df[df['Complaint Type']==i]
Noise_summary[i]=temp
for i in Noise_type:
print('The main type is', i)
subtype=Noise_summary[i]['Descriptor'].unique()
for j in subtype:
print(' The subtype is',j)
In summary, we have 5 maintypes and 36 subtypes, which are considered all main types and subtypes are valid. They consist of 97.8% of the whole noise cases. It is sufficient to obtain sound overall feature with 3.5M rows of data.
main_noise=df[df['Complaint Type'].str.contains('oise', regex=False)]
counts=main_noise['Complaint Type'].value_counts()
counts=counts.iloc[0:5,]
count=0
for i in df['Complaint Type']:
if 'oise' in i:
count+=1
print('The number of considered noise is {}'.format(counts.sum()))
print('The percentage of cosidered noise out of the whole noise cases is {:10.1f}%'.format(counts.sum()/count*100))
We created Choropleth map for distribution of noise cases acrss different blocks in 2019, by counting the number of cases for each zipcode.
In the first place, the data quality for the ten years (2010~2020) is analyzed.
df['Incident Zip'].unique()
Two main problems for the attribute Zipcode have been detected:
It is necessary to figure out the the percentage of the valid values. It is calculated as follows.
# verify each item if they have the following problems: nan, invalid character
import re
zipnan=df['Incident Zip'].isna()
zipnan=zipnan.to_numpy()
zipalph=[]
for i in df['Incident Zip']:
a=(re.search('[a-zA-Z]', str(i))!=None)
b=(re.search('[-]', str(i))!=None)
zipalph.append(a and b)
zipalph=np.array(zipalph)
percentage=zipalph.sum()+zipnan.sum()
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(percentage/df.shape[0]*100))
The percentage of invalid values is 5.79%, which is acceptable because we mainly focus on the overall distribution and trend of some focused features.
However, in the interactive map, we presented the noise distribution in 2019 so that a particular attention should be paid to the data in this year.
df['year']=[i.year for i in df['TransCDatetime']]
df_2019=df[df['year']==2019]
import re
zipnan1=df_2019['Incident Zip'].isna()
zipnan1=zipnan1.to_numpy()
zipalph1=[]
for i in df_2019['Incident Zip']:
a=(re.search('[a-zA-Z]', str(i))!=None)
b=(re.search('[-]', str(i))!=None)
zipalph1.append(a and b)
zipalph1=np.array(zipalph)
percentage=zipalph1.sum()+zipnan1.sum()
print('The percentage of invalid value for 2019 is {:10.2f}%'.format(percentage/df_2019.shape[0]*100))
We have seen that it is of better quality compared to the dataset (3.16% of 2019 to 5.79% of 2010~2020), which indicates improvement in data collection by the government.
But we still wanted to do correction for the invalid values in 2019. K-nearest-neighbours(KNN) is the machine learning algorithm can be adopted for this problem because the zipcode is determined by coordinates of the point. Therefore, the first thing came to our mind is to calculate the probability of invalid coordinates given invalid zipcode.
First, we should examine coordinates. After initial inspection, we found that NaN is involved in coordinates.
latnan1=df_2019['Latitude'].isna()
latnan1=latnan1.to_numpy()
print('The percentage of invalid value of coordinates for 2019 is {:10.2f}%'.format(latnan1.sum()/df_2019.shape[0]*100))
a=df_2019['Latitude'].isna() & df_2019['Longitude'].isna()
b=df_2019['Latitude'].isna()
print('Total number of NaN in Latitude is {}'.format(a.sum()))
print('Total number of NaN in Latitude or Longitude is {}'.format(b.sum()))
The two numbers are equal, which means that if NaN is present in Latitude, it is also present in the correspoding longitude.
After removing NaN value, we used boxplot to see if there is outliers in coordates in terms of value range.
f,p=plt.subplots(1,2,sharex=True,figsize=(20,5))
font=18
#titledict={'x':0.02,'y':0.9}
p[0].set_title('Latitude of noise cases',fontsize=font)
p[0].boxplot(df_2019[~b]['Latitude'])
p[0].tick_params(labelsize=font)
p[1].set_title('Longitude of noise cases',fontsize=font)
p[1].boxplot(df_2019[~b]['Longitude'])
p[1].tick_params(labelsize=font)
plt.show()
After removing the NaN values, all the cocordinates are in the territorial scope of New York City. We considered no other outliers included.
And then we are going to calculate the probability of invalid coordinate given invalid zipcode.
notused=0
for i in range(df_2019['Incident Zip'].shape[0]):
if latnan1[i] and zipnan1[i] and ~zipalph1[i]:
notused+=1
print('The percentage of invalid coordinate given invalid zipcode{:10.2f}%'.format(notused/percentage*100))
It means that for the invalid zip code, it is 99.83% likely not having its coordinates. Therefore KNN will not be effective and it is also inferred that if the government did not record the zipcode, they also did not get the position of the case.
Based on above analysis, we discarded the invalid values for zipcode and it will not have influence on the conclusion of the analysis.
We create a intearactive bar chart displaying distributions of various noise types in different boroughs.
In the first place, the data quality for the ten years (2010~2020) is analyzed.
df['Borough'].unique()
It is shown that the invalid value is 'Unspecified', for which we have calculated its percentage in the whole dataset.
unspecified_whole=(df['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_whole.sum()/df.shape[0]*100))
The percentage of invalid values is 5.35%, which is acceptable to discard the invalid values because we mainly focus on the overall distribution and trend of some focused features. However, in the interactive bar chart, we presented distributions of various noise types in different boroughs in 2019 so that a particular attention should be paid to the data quality for this year.
unspecified_2019=(df_2019['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_2019.sum()/df_2019.shape[0]*100))
We have seen that it is of better quality compared to the dataset (0.91% of 2019 to 5.35% to 2010~2020), which indicates improvement in data collection by the government.
Because 0.91% is quite a small percentage, we discarded the unspeicifed value and it will not have a influence on the conclusion of the analysis.
Because the dataset covers a great number of complaint types, it is necessary to narrow it down to the main ones to obtain the main trends and features of noise in the New York city. After data cleanning and preprocessing, the dataset only contains the necessary attributes for the report. The datasize has 22662415 rows and 12 colomns (10 of original attributes, year and month are derived attributes).
df.head(10)
counts=main_noise['Complaint Type'].value_counts()
counts=counts.iloc[0:5,]
plt.figure(figsize=(8,6))
counts.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each main type (The 5 most frequently)',fontsize=15)
plt.xlabel('Main noise type',fontsize=12)
plt.ylabel('Number of cases',fontsize=12)
plt.show()
The most frequently main type is Noise - Residiential, which shows that the noise cases are mostly reported by the residents. Below, we also sum up all the subtypes of noises.
sub_noise=main_noise['Descriptor'].value_counts()
plt.figure(figsize=(12,8))
sub_noise.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each subtype',fontsize=15)
plt.xlabel('Descriptor(Subtype of noise)',fontsize=12)
plt.ylabel('Number of cases',fontsize=12)
plt.show()
f,p=plt.subplots(len(Noise_type),figsize=(12,10),constrained_layout=True)
#f.tight_layout()
m=0
month_range=np.arange(df['month'].min(),df['month'].max()+1)
month_range_scarce=np.arange(df['month'].min(),df['month'].max()+1,5)
for i in Noise_type:
monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
drawn=df[df['Complaint Type']==i]['month'].value_counts()
# print('I am doing ', i)
for j in drawn.index:
monthly.loc[j]=drawn[j]
p[m].bar(month_range,monthly[month_range])
p[m].set_title(i,size=10)
p[m].tick_params(axis='x',labelrotation=90)
p[m].set_ylim(0,1.2*monthly.max(axis=0))
# p[m].tick_params(labelsize=30)
p[m].set_xticks(month_range_scarce)
p[m].set_xlabel('Month')
p[m].set_ylabel('Number of cases')
m+=1
We have observed that for the five main crime types, they all show an increasing trend from 2010 to 2020 and seasonal fluctuation.
We can obtain more information if the monthly trend of each subtype is plotted.
m=0
n=0
f,p=plt.subplots(18,2,figsize=(70,150))
for i in Noise_type:
subtype=Noise_summary[i]['Descriptor'].unique()
plt.subplots_adjust(hspace = 0.4)
for j in subtype:
monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
drawn=Noise_summary[i][Noise_summary[i]['Descriptor']==j]['month'].value_counts()
for k in drawn.index:
monthly.loc[k]=drawn[k]
p[m][n].bar(month_range,monthly[month_range])
p[m][n].set_title(i+': '+j,size=30)
p[m][n].tick_params(axis='x',labelrotation=90)
p[m][n].set_ylim(0,1.2*monthly.max(axis=0))
p[m][n].tick_params(labelsize=30)
p[m][n].set_xticks(month_range_scarce)
p[m][n].set_xlabel('Month',fontsize=30)
p[m][n].set_ylabel('Number of cases',fontsize=30)
n+=1
if n==2:
m+=1
n=0
After initial analysis, we focuses only on the subtype of noise with complete data (all available from 2010 to 2020). Generally they show the seasonal pattern, more cases in the summer while less in the winter. Besides that, we sorted them into three catogories in terms of overall trend.
from scipy.stats import gaussian_kde
main_noise=main_noise[~np.isnan(main_noise['Latitude'])]
font=18
# histogram
f,p=plt.subplots(2,1,figsize=(8,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Latitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Latitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Latitude'])
m,n=np.histogram(main_noise['Latitude'],bins=50)
p[0].set_ylabel('Number of cases')
p[1].plot(n,density(n))
#p[1].tick_params(labelsize=font)
p[1].set_xlabel('Latitudes')
p[1].set_ylabel('Density')
plt.show()
f,p=plt.subplots(2,1,figsize=(8,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Longitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Longitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Longitude'])
m,n=np.histogram(main_noise['Longitude'],bins=50)
p[0].set_ylabel('Number of cases')
p[1].plot(n,density(n))
#p[1].tick_params(labelsize=font)
p[1].set_xlabel('Latitudes')
p[1].set_ylabel('Density')
plt.show()
Based on the histogram, we observed how the coordinates are distributed and the pattern fits the territorial shape of New York city.
These plots analyze 5 noise with other 49 (245 mappings all together) the highist rank complain type relationships.
Each point in each plot is the total number in every month between two complain types. For example, in the plot of Illegal Parking and Noise - Residential, one point in (7807,2559) is mean in 2011-August, there are 7807 cases of Noise - Residential and 2259 cases of Illegal Parking.
We try to find some plots with strong or interesting connection relation in between five noise and other conplain types.
from scipy import stats
df = pd.read_csv('311-ac-monthGroupList.csv')
filename2 = '311-All-Cmplaint Type-Groupby.csv'
dfNoise = ['Noise - Residential' , 'Noise - Street/Sidewalk' ,'Noise - Commercial' ,'Noise - Vehicle','Noise']
nr = ['Illegal Parking', 'Blocked Driveway', 'Noise', 'Sewer', 'PAINT - PLASTER', 'Noise - Commercial']
ns = ['Noise - Residential', 'Illegal Parking', 'Blocked Driveway', 'HEATING', 'Traffic Signal Condition', 'Damaged Tree', 'Rodent', 'Consumer Complaint', 'New Tree Request', 'Overgrown Tree/Branches', 'Maintenance or Facility', 'Elevator', 'Root/Sewer/Sidewalk Condition']
nc = ['Noise - Residential', 'Illegal Parking', 'Blocked Driveway', 'PLUMBING', 'Water System', 'GENERAL CONSTRUCTION', 'Noise', 'Noise - Street/Sidewalk', 'Taxi Complaint']
nv = ['Noise - Residential', 'HEAT/HOT WATER', 'Street Light Condition', 'HEATING', 'Noise - Street/Sidewalk', 'UNSANITARY CONDITION', 'Traffic Signal Condition', 'Sewer', 'Dirty Conditions', 'Sanitation Condition', 'Rodent', 'Building/Use', 'Derelict Vehicles', 'Consumer Complaint', 'Graffiti', 'New Tree Request', 'Overgrown Tree/Branches', 'Maintenance or Facility', 'Elevator', 'Root/Sewer/Sidewalk Condition', 'Food Establishment']
no = ['Noise - Residential', 'HEATING', 'Noise - Street/Sidewalk', 'Traffic Signal Condition', 'Dirty Conditions', 'New Tree Request', 'Overgrown Tree/Branches', 'Root/Sewer/Sidewalk Condition', 'Illegal Parking', 'Blocked Driveway', 'PLUMBING', 'General Construction/Plumbing', 'Noise - Commercial', 'Broken Muni Meter', 'Taxi Complaint']
dfT50 = pd.read_csv(filename2).head(n=50)
# plot all plot with five noise and 49 others
for i in dfNoise:
f1,p1 = plt.subplots(7, 7, figsize=(40,40))
k=0
n=0
ci = 0
for j in dfT50['Complaint Type']:
temp = df[df["Complaint Type"]==i]
# print(i)
if i==j:
continue
temp2 = df[df["Complaint Type"]==j]
len1 = len(temp)
len2 = len(temp2)
# print(str(len1)+i+'-'+str(len2)+j)
if len1>len2:
temp = temp.head(n = len2)
else:
temp2 = temp2.head(n = len1)
slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
p1[n][k].scatter(temp['monthSize'],temp2['monthSize'],marker='o')
p1[n][k].plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',label='slope: '+str(round(slope, 2)))
p1[n][k].set_xlabel(i)
p1[n][k].set_ylabel(j)
p1[n][k].legend(loc="upper left")
if(k == 6):
n=n+1
k=0
else:
k=k+1
name = i.replace('/', '-', 1)
# f1.savefig('img/connection-'+name+'.png')
The following plots were selcted considred that they deserve further exploration.
select = [nr,ns,nc,nv,no]
total = len(nr)+len(ns)+len(nc)+len(nv)+len(no)
# select all necessary plot
f1,p1 = plt.subplots(8, 8, figsize=(40,40))
k=0
n=0
big = 0
small = 0
s1 = []
s2 = []
for kk in range(5):
i = dfNoise[kk]
cat =select[kk]
# for i in dfNoise:
for j in cat:
temp = df[df["Complaint Type"]==i]
temp2 = df[df["Complaint Type"]==j]
len1 = len(temp)
len2 = len(temp2)
if len1>len2:
temp = temp.head(n = len2)
else:
temp2 = temp2.head(n = len1)
slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
if slope > 0:
big = big+1
s1.append([i,j,slope])
elif slope <= 0:
small = small+1
s2.append([i,j,slope])
p1[n][k].scatter(temp['monthSize'],temp2['monthSize'],marker='o')
p1[n][k].plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',\
label='slope ~ R: '+str(round(slope, 2))+' ~ '+str(round(r_value, 2)))
p1[n][k].set_xlabel(i)
p1[n][k].set_ylabel(j)
p1[n][k].legend(loc="upper left")
if(k == 7):
n=n+1
k=0
else:
k=k+1
# name = i.replace('/', '-', 1)
# f1.savefig('img/connection-'+name+'.png')
print("slope>0:"+str(big))
print("slope<=0:"+str(small))
After analyzed the plot above, the result show a positive correlation Illegal Parking with 'Noise - Residential' , 'Noise - Street/Sidewalk' ,'Noise - Commercial' and 'Noise(others)'.
Is this a coincidence? Or in the real world, there is indeed a connection between them, we use folium to draw a map to explore whether there is a spatial relationship.
car_noise = ['Noise - Residential']
park = 'Illegal Parking'
f1,p1 = plt.subplots(figsize=(10,5))
f1.suptitle('The value relationship between Illegal Parking and other four kind noise during 2010-2020')
big = 0
small = 0
for kk in car_noise:
temp = df[df["Complaint Type"]==kk]
temp2 = df[df["Complaint Type"]==park]
len1 = len(temp)
len2 = len(temp2)
if len1>len2:
temp = temp.head(n = len2)
else:
temp2 = temp2.head(n = len1)
slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
p1.scatter(temp['monthSize'],temp2['monthSize'],marker='o')
p1.plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',\
label='slope ~ R: '+str(round(slope, 2))+' ~ '+str(round(r_value, 2)))
p1.set_xlabel(kk)
p1.set_ylabel(park)
p1.legend(loc="upper left")
Here, we present Illegal Parking and Noise - Residential to see their relationship. In the folium map, we grapped 500 random GPS sample points from each catagory during year 2010-2020. The background heat map is for displaying Illegal Parking distribution, then the points are sample from Noise - Residential.
In this map, The noisy concentrated areas are gathered near a large number of illegal parking. This is in line with certain practical conditions. It is likely that the noisy areas are densely populated and parking spaces are scarce.
But this phenomenon is not very clear, it can not give a directly explaination of relationship between Illegal Parking and Noise - Residential.
import folium
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
df_whole = pd.read_csv('311-GPS-noise-parking.csv')
# the relationship between Illegal Parking and Noise - Residential
# Map show
borough_map='Borough Boundaries.geojson'
# Do the plotting with sampling: Scatter plot
map_hooray_scatter=folium.Map(location=[40.7128, -74.0060],tiles = "Stamen Toner",zoom_start=10.5)
selected=df_whole[df_whole['Complaint Type']=='Illegal Parking'].sample(500)
selected=selected[~np.isnan(selected['Latitude'])]
cmlist=selected[['Latitude','Longitude']].values.tolist()
selected_NR=df_whole[df_whole['Complaint Type']=='Noise - Residential'].sample(500)
selected_NR=selected_NR[~np.isnan(selected_NR['Latitude'])]
cmlist_NR=selected_NR[['Latitude','Longitude']].values.tolist()
folium.GeoJson(borough_map).add_to(map_hooray_scatter)
# for i in cmlist:
# folium.CircleMarker(i,width=30, height=30, radius=3.5, weight=2.0, color='#0000CC', fill_color='#0066FF', opacity=0.75, fill_opacity=0.5).add_to(map_hooray_scatter)
for i in cmlist_NR:
folium.CircleMarker(i,width=30, height=30, radius=3.5, weight=2.0, color='#0000CC', fill_color='#0066FF', opacity=0.75, fill_opacity=0.5).add_to(map_hooray_scatter)
HeatMap(cmlist,max_zoom=1000000,radius=20).add_to(map_hooray_scatter)
map_hooray_scatter
For this project, the focus is about statistical analysis, visualization and story-telling. No machine learning problems are involved in the analysis, except that we planned to use K-nearest-neighbours to make some correction for the default or invalid values in the attribute 'Incident Zip'. As described in the data cleaning section, it is impossible to implement KNN for mostly both coordinates and zipcode are missing at the same time.
For visual narrative, we chose the interactive slideshow, which we thought would be a good way to balance author-driven and reader-driven stories. There is an overall time narrative structure (e.g., slideshow), however, at some point, the user can manipulate the interaction visualization(interactive map and interactive bar in this project) to see more detailed information so that the reader can better understand the pattern or extract more relevant information (e.g., via interacting with a slideslideshow). Readers should also be able to control the reading progression themselves.For highlighting, zooming is conducted by us, readers can further explore the details that arouse their interests.
Linear ordering is selected by us in order to form a complete story line, hover details and selection are conducted in interactive parts. We maintain these can increase the reader's sense of participation and interactivity in reading. In the messaging section, headlines, annotations,introductory and summary are used. The headline give the readers the guidance about the specific content of the article while the annotation help readers get more information description.The introduction plays the role of arousing readers' interest and attracting them to further reading, while the summary conclude the content and stimulate readers' thinking, both of which give readers have a complete concept of the whole story.
It is an interactive choropleth map which shows not only overall distribution of the reported cases but also detailed information of each block.
The color of one block indicates how many repored noise cases per hectare in it and readers can easily get a good understanding of the overall distribution with reference to the color bar.
Besides, when you put your mouse on a maker and click it, you will get the zip number, block name and the number of cases per hectare.
It is an interactive bar that shows the distribution of top ten noise subtypes in the five boroughs of New York.
We sorted out the top 10 sub-noise types in terms of frequency and calculatd the percentage for each borough. The x axis presents the 10 noise type while the y axis illustrates the percentage for each borough. When the mouse is moved onto the bar, it shows the accruate value of percentage.
From interactive choropleth map and bar chart, readers can get a general understanding of the problem but also the detailed information as their interest. Also, we provide our own story line to tell the readers what we have found and want them to know and use necessary supplementary material (statsitics and images) to help readers better understand. These storyline origniates from the phenomenon presented in the interactive visualization. Therefore, we think they are the right tools for the report.